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1 . INTRODUCTION 

In  ( Kanal  (1974)  ],  while  conunentlng  on  feature-subset 
selection,  it  was  mentioned  that  the  possibility  of  posing  many 
problems  in  pattern  recognition  as  graph-searching  problems, 
suggested  approaches  likely  to  receive  attention  in  the  near 
future.  The  paper  also  mentioned  the  prospect  of  Increasing  inter- 
play between  pattern  recognition  and  "problem-solving"  techniques 
of  Artificial  Intelligence  (A. I.)  [Nilsson,  (1971)]. 

The  embedding  of  the  statistical  and  syntactic  models  of 
pattern  recognition  theory  into  the  state-space  and  AND/OR  graphs 
and  search  strategies  of  A. I.  and  the  development  of  connections 
between  these  various  representations  have,  as  anticipated  led  to 
new  results  and  a somewhat  different  way  of  thinking  about  pattern 
recognition.  For  example,  it  has  led  us  to  view  many  pattern 
recognition  probl^s  in  terms  of  a joint  search  of  paths  in  a model 
space  and  a data  space  with  feedback  between  the  two  searches. 

Section  2 of  this  paper  states  some  major  limitations  of  the 
standard  multivariate  statistical  approach  to  pattern  classifica- 
tion and  motivates  the  renewed  Interest  in  multiclass  multistage 
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classification.  Section  3 categorizes  recent  contributions  in  the 
literature  on  multistage  decision  schemes.  Section  4 outlines  gen- 
eral state  space  representations  and  ordered  search  strategies , 
and  section  5 Illustrates  how  these  serve  as  theoretical  models  for 
multistage  multiclass  pattern  classification.  General  state-space 
graphs  for  nearest  neighbor  (NN)  classification  are  described  in 
section  6,  which  also  mentions  recent  results  on  NN  error  estima- 
tion and  bounds  using  labelled  and  unlabelled  samples. 

The  shortcomings  of  the  usual  linguistic-syntactic  models  are 
listed  in  section  7 and  used  to  motivate  the  problem-reduction 
representations  defined  in  section  8,  which  comments  on  efficient 
feature  extraction  using  problem-reduction  and  state-space  repre- 
sentations. Section  9 contains  concluding  remarks  on  this  and 
other  research  directions  and  problems.  The  bibliography  guides 
the  reader  to  recent  papers,  surveys,  conference  proceedings 
and  edited  collections  covering  broad  areas  of  pattern  recognition 
methodologies  and  their  applications. 

2.  KULTISTAGE  CLASSIFICATION  SCHEMES: 

RATIONALE  FOR  RENEWED  INTEREST 

In  various  applications  of  current  interest,  e.g.,  biomedical 
image  recognition  and  remote  sensing,  the  number  of  features  N is 
quite  large  and  the  classes  M are  multimodal  and  also  numerous. 

For  such  problems  multistage  classification  schemes  are  more  fre- 
quently being  used  than  "one-shot"  classifiers  which  give  an  M way 
decision  in  a single  step,  and  use  a common  set  of  features  for  all 
classes. 

A straight  casting  of  such  pattern  recognition  problems  into 
the  standard  molds  of  multivariate  statistical  classification 
theory  usually  requires  the  estimation  of  high  dimensional  unknown 
distributions  and  unknown  a priori  probabilities  for  many  categories. 
Despite  great  theoretical  attention,  the  practical  modeling  and 
estimation  of  highly  multivariate  distributions  remains  essentially 
infeasible . 
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The  rationale  for  multistage  classification  Is  that  decomposing 
the  multiclass  classification  problem  Into  several  stages  may  sim- 
plify the  decision  making  process  In  practice.  For  example,  a 
priori  (structural)  knowledge  concerning  the  physical  and  biologi- 
cal relationships  between  categories  and  groups  of  categories  may 
be  explicitly  used  to  structure  the  skeleton  of  the  decision  tree. 

If  for  an  M-class,  N-feature  problem,  processing  has  to  be 
limited  to  no  more  than  K<<N  dimensional  distributions,  then  dif- 
ferent subsets  of  K features  at  the  various  stages  promise  better 
results  than  a single,  "best"  set  of  K features  used  In  an  M-way 
"one-shot"  decision.  As  shown  In  [Cover  and  Van  Campenhout  (1976)]  no 
non-exhaust Ive  sequential  K-feature  selection  procedures  In  which 
the  subsets  are  constrained  to  nest,  can  be  optimal  even  for  jointly 
normal  measurements. 

Recall  that  to  avoid  exhaustive  search,  suboptlmal  procedures 
such  as  forward  sequential,  backward  sequential  and  K- Individually 
best  features  have  been  used  together  with  various  distance  measures 
or  other  criteria  [see  Kanal  (1974)].  Cover  and  Van  Campenhout  (1976) 
show  that  there  exist  reasonable  class  p.d.f.'s  for  which  these 
search  strategies  lead  to  the  worst  possible  K-feature  subset, 

1 < K < N.  For  the  normal  model  with  N = 4 they  give  a simple 
numerical  example  In  which  the  best  2-feature  set  consists  of  the 
two  Individually  worst  measurements  and  neither  of  the  best  3-ele- 
ment or  2-element  subsets  Includes  the  best  subset  of  lower  cardin- 
ality. In  the  context  of  hierarchical  classifiers.  It  Is  easily 
proven  (Kulkarnl  (1976)]  that  at  each  node  of  a given  tree  struc- 
ture, the  feature  assignment  which  gives  the  node  decision  with  the 
least  average  error.  Is  not  necessarily  the  overall  optimal  assign- 
ment of  features.  And  In  general,  maximizing  classification  per- 
formance at  each  Individual  node  of  a decision  tree  does  not  result 
In  an  overall  optimum  decision  tree. 

Thus  measuring  a large  number  of  common  features  on  a large 
number  of  categories  and  then  using  the  above  mentioned,  popular, 
feature  subset  selection  procedures  to  reduce  the  feature  set  for 


J 


subsequent  design  of  a one-shot  classifier.  Is  unlikely  to  be  satis- 
factory. Furthermore,  even  with  multistage  decomposition,  straight 
optimizing  of  each  node  decision  Is  unlikely  to  give  good  results. 

To  avoid  the  first  pitfall,  the  original  definition  and  extrac- 
tion of  features  needs  to  be  less  arbitrary  than  Is  usually  the  case 
In  multivariate  statistical  classification.  Model-directed,  data- 
conflrmed,  structural  feature  extraction  seems  relevant  In  many 
situations  and  Is  briefly  described  later  In  this  paper.  Also 
called  for  are  systematic  procedures  for  the  general,  global,  design 
of  hierarchical  classifiers. 
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3.  MULTISTAGE  DECISION  SCHEMES-A  SELECTIVE  SURVEY 


Procedures  proposed  In  the  literature  for  designing  multi- 
stage decision  schemes  consist  of:  (a)  converting  decision  tables 
Into  optimal  decision  trees  [e.g.  Knuth  (1971),  Melsel  & Mlchalopoulos 
tlon  rules;  (c)  hierarchical  classification  methods.  In  sequential 
classification  schemes,  the  features  are  linearly  ordered,  but  In 
most  cases,  no  particular  order  Is  Imposed  on  the  class  labels.  Any 
class  label  may  be  accepted  at  any  stage.  Hierarchical  classifiers 
are  characterized  by  the  hierarchical  ordering  Imposed  both  on  the 
features  and  the  class  labels.  At  each  stage  of  the  measurement 
process  some  classes  are  rejected  from  further  consideration  as 
candidates  for  the  test  sample's  label. 

The  table  conversion  methods  [Knuth  (1971)  , Melsel  & 

Mlchalopoulos  (1973),  Pollack  et  al  (1971),  Relnwald  & Soland  (1966), 
Winston  (1969) ] assume  a decision  table  Is  given  and  do  not  address 
the  problems  of  feature  selection  and  classification  error.  For 
example  In  Melsel  & Mlchalopoulo8(1973)  the  feature  space  Is  al- 
ready assumed  to  have  been  partitioned  Into  the  various  decision 
regions.  Then  a recursive  procedure  arranges  a set  of  piece  wise 
constant  boundaries  In  metric  space  so  as  to  minimize  the  average 
number  of  comparisons  needed  to  classify  a sample.  Thus  the 
algorithm  rearranges  the  order  of  the  tests  optimally  without 
affecting  the  mlsclasslf Icatlon  rate. 


Host  of  the  parametric  and  nonparametrlc  multlclaaa  claaalfl- 
cation  achemaa  proposed  In  the  literature  may  be  described  within 
the  framework  of  state  space  graph  models  and  ordered  search 
strategies. 

4.  STATE  SPACE  REPRESENTATIONS  AND  ORDERED  SEARCH 

A state-space  representation  (SSR)  for  a problem  consists  of 
specifying  a set  S of  state  descriptions;  a set  1 C S of 

initial  states;  a set  S*  C S of  goal  states;  state  transition 
operators  T which,  applied  to  state  s^,  produce  successor  states, 
T(s^).  SSR  Is  defined  Informally  in  [Nilsson  (1971)]  and  formally 
in  [Stockman  (1977)]  In  which  an  ordered  successor  function 
q:  Sx  N ->-S,  with  N the  set  of  natural  numbers,  is  used  to  define 
the  successors  of  state  s^  as  Q(8^)«  {s^5  8j^*q(8j^, j)  for  some  J} 

A directed  graph  model  for  state-space  search  uses  nodes  of 
the  graph  to  represent  problem  states  and  associates  arcs  of  the 
graph  with  the  operators.  A state  space  representation  is  then 
viewed  as  an  implicit  definition  of  a directed  graph  and  a search 
strategy  becomes  a process  of  "expanding"  nodes  step  by  step  to 
obtain  successor  nodes,  thereby  making  explicit  a portion  of  the 
implicitly  defined  graph,  in  order  to  find  a path  from  the  Initial 
node  to  a goal  node. 

Breadth-first,  depth-first,  best-first,  and  heuristic  search 
are  the  descriptive  names  used  in  the  (Al)  literature  [Winston 
(1977)]  for  some  of  the  widely  used  search  strategies,  which  are 
closely  related  to  Dynamic  Programming  [ Dreyfus  & Law  (1977)] 
Backtrack  Programming  [Golomb  & Baumert  (1965)]  and  Branch  and 
Bound  [Kohler  & Stelglltz  (1974)]. 

An  ordered  search  strategy  attempts  to  make  the  search 
efficient  by  ranking  the  nodes  available  for  expansion  at  each 
state  of  the  search  according  to  some  merit  criteria.  Assuming 


that  a node  has  a finite  number  of  successors,  a general  ordered 
search  procedure  Is  defined  by  Che  following  algorithm. 

Let  E be  Che  set  of  nodes  which  have  been  expanded  (often 
called  Che  CLOSED  list) , E the  set  of  nodes  which  are  candidates 
for  expansion  (the  OPEN  list)  and  T(s) , the  set  of  successors  of 
node  s.  Initially  set  E Co  the  empty  set  and  E Co  the  start  node. 

(1)  If  E Is  empty  exit  with  failure, 

(2)  Choose  s e E such  that  s has  best  merit,  resolving 
ties  arbitrarily, 

(3)  If  s Is  a goal  node,  exit  with  success  obtaining 
the  solution  path  by  tracing  back  through  Che 
pointers, 

(4)  Remove  s from  E and  place  s In  E.  Expand  node  s 
generating  all  Its  successors;  If  there  are  no 
successors  go  to  1, 

(5  ) For  each  successor  t e x (s) 

(a)  If  t ^ E and  C ^ E,  place  t In  E with  a pointer 
to  its  parent  node  s ; (b)  if  t £ E or  t e E and 
new  merit  is  better  than  old  merit,  place  t in  E 
with  pointer  to  s and  redefine  merit  of  t, 

(6)  Go  Co  1. 

The  merit  of  a node  may  be  defined  via  an  evaluation  function 
f(s)  which  uses  selected  features  of  a state  s.  A much  used  evalua- 
tion function  Is  f(s)  « c(s)  + h(s)  where  c(s)  denotes  the  sum 
of  Che  costs  of  Che  operators  leading  Co  the  generation  of  state  s 
from  the  Initial  state,  and  h(s)  Is  a "heuristic"  component,  fre- 
quently based  on  some  measure  of  difference  between  selected 
features  of  Che  state  s and  a goal  state.  Ordered  search  with  this 
function  is  referred  to  as  the  A*  algorithm  [Nilsson  (1971)].  In 
the  directed  graph  representation  c(s)  is  the  cost  Incurred  In  going 
from  Che  Initial  node  to  node  s and  h(8)  Is  an  estimate  of  Che 
cost  of  a path  from  s to  the  nearest  goal  node.  At  a goal  node  s*, 
h(s*)  Is  defined  to  be  zero. 
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The  above  evaluation  function  may  be  generalized  by  defining 
f(s)  = (l-o)c(s)  + ah(s),  ae  [0>1]*  a = 0 gives  "uniform  cost 
search",  a = 1 gives  "pure  heuristic  search",  while  a = 1/2  gives 
the  previously  defined  "diagonal  search"  function  used  in  the  A* 
algorithm.  Certain  properties  of  A*  were  defined  and  Investigated 
by  Hart  et  al  [1968],  Corrections  [Hart  et  al  (1972),  Gelperin 
(1977)]  and  generalizations  of  certain  types  [Pohl  (1970),  Harris 
(1974),  Martelli  (1977)]  have  appeared  subsequently. 

A 6-graph  has  been  defined  as  a graph  with  positive  arc  costs 
[Nilsson  (1971)]  and  more  generally  [Vanderbrug  (1977)],  so  as  to 
allow  finitely  many  arcs  of  zero  cost.  In  either  case,  the  follow- 
ing properties  hold: 

(a)  an  ordered  search  strategy  is  complete  l.e.,  finds  a solution 
whenever  it  exists,  iff  a e [0,1); 

(b)  if  a heuristic  satisfies  a lower  bound  condition  h(s)  < h (d) 

— P 

for  all  nodes  s,  where  h^  is  the  perfect  heuristic  (true 

remaining  cost  to  nearest  goal),  and  iff  a e [0,1/2]  then  : 

search  with  li  is  admissible,  i.e.,  terminates  with  a 

minimum  cost  solution  whenever  one  exists.  Underestimating  ‘ 

at  each  stage,  the  distance  remaining  to  the  goal,  thereby  i 

I 

underestimates  the  total  path  length.  Since  the  actual  cost  | 

along  some  completed  non-minimum  cost  path  cannot  be  less  than  | 

an  underestimate  of  the  cost  along  an  Incomplete  minimum  cost  j 

path,  the  process  of  repeatedly  extending  the  path  which  thus  j 

far  has  lowest  underestimated  cost  will  guarantee  that  the  | 

least  cost  path  is  found.  Of  course  the  closer  h is  to  h ] 

P i 

the  more  efficient  will  be  the  search.  1 

(c)  to  compare  two  admissible  strategies  using  the  ordered  search 
algorithm,  the  concept  of  optimality  of  a search  strategy  is 
Introduced.  Note  that  admissibility  refers  to  optimality  of 
the  solution.  Let  h^  and  h^  be  heuristic  functions  satisfying 
h2(s)  < hj^(s)  ^ h (s)  for  all  non  goal  nodes  s,  and  let 

ae  [0,1/2].  Then  for  all  6 -graphs  containing  a minimum  ' 
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cost  solution,  search  with  expands  at  least  as  many  nodes 
as  expanded  by  search  with  h^^.  See  also  Gelperin  [1977]. 

(d)  If,  for  any  two  nodes  s and  s',  which  are  connected  by  a path  of 
cost  c(s,s'),  it  is  assumed  that  h(s)  - h(s')  £ c(s,s'),  then 
h is  consistent.  If  h is  consistent  then  for  a = 1/2,  the 
ordered  search  algorithm  never  has  to  reopen  a closed  node, 
i.e.,  when  it  expands  a node  it  has  already  found  a minimum 
cost  path  to  that  node. 

Vanderbrug  (1977)  substitutes  easily  followed  geometric  proofs  for 

the  algebraic  proofs  given  previously  for  the  above  properties.  i 

The  next  section  illustrates  how  a generalization  of  the 
abrve  approach  serves  as  a theoretical  model  for  multistage,  multi-  ’ 

class  classification  [Kanal  & Kulkarni  (1976),  Kulkarni  (1976)].  < 

5.  STATE-SPACE  GRAPHS  FOR  MULTICLASS  CLASSIFICATION  I 

■ ■-  ' ■■■— ' ■ I - j 

In  the  state-space  graph  G = {S,E,F,W,c,r}  let  a state 

s e S be  a tuple  {F  , W },  where  F is  a subset  of  the  total  feature  J 

s s s 

set  F,  and  W is  a subset  of  the  total  set  of  class  labels  W.  W 
’ s s 

denotes  the  possible  classifications  that  can  be  made  on  any  path  I 

in  the  graph  passing  through  the  state  s.  An  edge  e e E represents  j 

the  action  of  measuring  a particular  feature  or  set  of  features,  | 

and  has  an  associated  measurement  cost  determined  by  c,  a non-nega-  | 

tive  real  valued  cost  function.  For  a goal  state  s*,  F^  = 0 (null) 

and  W contains  one  or  zero  (X  = reject)  class  labels.  At  a goal 
s 

state,  a mlsclassif Ication  risk  r(s*) , is  Incurred.  The  initial 
or  start  node  of  G contains  all  the  possible  class  labels  including 
X,  the  reject  class. 

If  N(s*)  is  the  set  of  nodes  on  a path  to  a goal  node  s*, 
c(N(s*))  is  the  sum  of  arc  costs  along  that  path  and  r(s*)  the  risk 
at  s*,  then  the  total  cost  of  making  the  decision  s*  is 
f(s*)  = c(N(s*))  + r(s*).  Two  possible  broad  categories  of  class- 
ification schemes  are: 

(i)  the  risk  r(s*)  depends  only  on  the  features  x^  measured  along 
the  path  from  the  initial  state  to  goal  s*,  and  is  denoted  by 


r(s*/x^).  An  "S-admlssible"  strategy  terminates  at  that  goal  s* 

in  G for  which  f(s*)  = c(N(s*))  + r(s*/x  ) is  minimum; 

s ’ 

(ii)  The  risk  r(s*)  is  a function  of  all  the  measurements,  not 

(k) 

just  those  on  the  path  to  s*.  If  x denotes  the  features  observed 

(k) 

until  stage  k,  then  the  risk  is  r(s*/x  ) and  it  could  change  as 
more  features  are  observed  until  it  reaches  the  value  r(s*/x).  A 
"B-admissible"  stra :egy  finds  that  (category)  node  s*  for  which 
f(s*)  = c(N(s*))  + r(s*/x)  is  minimum  ; a Bayes  optimal  strategy 
results  when  arc  costs  are  set  to  zero.  In  certain  cases,  B- 
admissible  strategies  to  find  the  optimal  category  can  be  formulated 
without  having  to  observe  the  total  set  of  features. 

Ordered  search  algorithms  for  S-admlssible  and  B-admissible 
strategies  for  multiclass  pattern  classification  can  be  realized  by 
defining  an  evaluation  function  for  node  s as  f(s)  = c(s)  + h(s)  + 
l(s),  where  c(s)  is  the  arc  cost  from  the  starting  node  to  node  s, 
h(s)  is  an  estimate  of  the  arc  cost  from  s to  a goal  node  accessible 
from  s,  and  l(s)  is  an  estimate  of  the  risk  of  a goal  node  accessible 
from  s. 

S-Admissible  Ordered  Search  Algorithm 

This  algorithm  called  Algorithm  S,  differs  from  algorithm  A* 
described  earlier,  in  the  additional  term  used  to  estimate  the  risk 
at  a goal.  Let 
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- r(s/x  ) if  s if  a goal  node 
s 


Min 

jeW 


s L 


Min 

yeF(j)— F(s) 


r(j/Xg,Y) 


otherwise 


Here  F(j)  denotes  the  measurement  space  spanned  by  the  set  of  features 
measured  on  the  path  to  node  j,  and  Y e F(j)o'F(s)  is  a vector  in 
the  complement  space  of  F(s)  with  respect  to  F(j).  Also 


h(s)  < 


f = 0 if  s is  a goal  node 
< Min  c(s,j)  otherwise 
jeW^ 
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where  c(s,j)  is  Che  sum  of  arc  costs  from  s to  J . Closing  a node 
implies  observing  a particular  feature  set  associated  with  the  node. 
An  S-admlsslble  strategy  terminates  when  It  first  puts  a goal  node 
on  Che  CLOSED  list. 

B-Admlsslble  Ordered  Search  Algorithm 

If  the  goal  risk  were  to  change  with  additional  measurements 
taken  on  other  paths,  then  the  optimality  of  the  first  goal  put 
Into  CLOSED  cannot  be  guaranteed,  because  Che  additional  observations 
may  Increase  that  goal's  risk  while  decreasing  Che  risk  of  some 
other  goal.  Algorithm  B,  which  gives  B-admlsslble  search  strategies 
uses  an  upper  bounding  function  on  the  risk  and  a lower  bounding 
function  analogous  to  that  used  by  Algorithm  S. 

Now  the  term  l(s)  which  estimates  the  risk  is  defined  by 


1(8) 


(k) 

YeF~F''  ' 

r(s/x^ 

Min 

r Min 

jew 
■’  n 

[yeF^ 

(k) 


,(k) 


r(s/x  , Y)1  for  s a 
J non- goal 
node. 


(k) 

where  x denotes  the  vector  observations  on  a test  sample  after 

(k) 

k stages  of  the  algorithm  and  Y e F~F  denotes  a random  vector 
In  the  complement  space  with  respect  to  the  total  feature  set  F. 
In  algorithm  B for  each  goal  node  s put  on  closed,  an  upper 

(k) 

bounding  function  b(s)  Is  computed,  with  b(s)  = c(s)  + u(s/x  ) 
and  u satlsCles  the  Inequality 

u(s/x^*^^)  > Max  r(s/x^*^\  Y) 

“ yep/wF''*^' 


If  s*  Is  the  goal  node  with  minimum  b(s),  and  if  for  all  nodes 
In  the  union  of  OPEN  and  CLOSED,  b(s*)  ^ f(s),  Chen  Che 
algorithm  exits  with  s*  as  the  B-optimal  category.  Algorithm  B does 
not  terminate  when  a goal  node  Is  put  In  the  CLOSED  list.  Instead, 
at  each  Iteration  after  a goal  node  Is  put  on  CLOSED,  Algorithm  B 
checks  If  there  Is  some  goal  node  on  CLOSED  such  Chat  the  upper 
bound  on  Its  cost  for  any  possible  future  measurement  sequence. 


J 


is  less  than  the  lower  bound  on  the  cost  of  any  other  goal  node, 
either  in  CLOSED,  or  below  some  OPEN  node  in  the  graph.  If  so, 
the  algorithm  terminates  with  that  goal  node  as  the  decision. 

For  certain  parametric  probability  functions  it  may  be  possible 
to  derive  tight  bounds  on  the  risk  of  the  goal  accessible  from  a 
node  s,  as  is  shown  in  [Kulkaml  & Kanal  (1978)]  where  proofs  for 
various  properties  of  the  algorithms  appear.  That  paper  also  shows 
by  example  how  ,B- admissible  search  of  a state-space  graph  for  leuko- 
cyte classification  Improves  on  the  usual  decision  tree  approach,  in 
which,  after  reaching  a terminal,  no  logical  strategy  exists  for 
reconsidering  classes  possibly  discarded  earlier. 

6.  NEAREST  NEIGHBOR  CLASSIFICATION 
For  the  nonparametrlc  case,  nearest  neighbor  (NN)  classifica- 
tion rules  are  receiving  increasing  attention.  See  e.g. ,[  Dasarathy 
& Sheela  (1977),  Kanal  (1974)].  Given  an  unlabelled  (test)  random 
sample  x,  the  search  for  its  nearest  neighbor  among  a set  of 
labelled  (design)  samples  can  be  modelled  as  searching  a state 
space  graph,  G(S,E,X,D,c,d) . Here  D refers  to  the  total  set  of 
design  samples  Y2 Y^,  S is  the  set  of  states  (nodes) 

and  each  state  s e S is  a tuple  (F  , D ) where  F , defined  by  the 

s 8 8 

features  measured  on  the  path  to  s,  is  a subspace  of  the  feature 

space  F.  D denotes  a subset  of  D;  for  a goal  state  |d  | ° 1, 
s s 

i.e.,  it  consists  of  a single  labelled  sample.  At  a goal  node,  d 
is  the  distance  or  other  similarity  measure  between  the  test 
sample  X and  the  design  sample  represented  by  the  goal  node.  E 
and  c refer,  as  before,  to  the  edge  set  and  cost  function. 

Depending  on  the  cost  functions  defined  for  the  edges  and  goal 
nodes  various  NN  rules  can  be  defined.  For  example,  if  c(s*)  is  the 
sum  of  arc  costs  on  the  path  to  goal  node  s*,  Ys*  is  the  labelled 
sample  represented  by  s*,  d(X,  Y^^  ; F^^)  is  the  distance  measured 
in  the  subspace  F * C F,  then  one  NN  procedure  [Kulkarnl  (1976)] 
defines  the  optimal  goal  s*  to  be  the  one  for  which  f(s*)  =* 
c(s*)  + d(X,  Y * ; F *)  has  minimum  value.  Note  that  in  this 

8"  S” 


procedure,  the  distance  Is  computed  only  In  the  subspace  defined 
by  the  features  measured  In  getting  to  s*.  Such  a procedure  may 
be  suitable  when  certain  features  are  significant  while  others, 
are  Irrelevant  to  defining  the  similarity  between  a random  test 
sample  and  a labelled  sample  from  a particular  class. 

An  alternative  would  be  not  to  assume  that  some  features  are 
unimportant  for  the  distance  computation  while  retaining  the  trade- 
off between  feature  measurement  cost  (arc  costs)  on  the  path  to  the 
goal,  and  the  risk  of  not  finding  the  true  nearest  neighbor.  In 
this  case,  the  optimal  goal  s*  minimizes  f(8*)  - c(s*)  + d(X,Y^^;F). 

The  conventional  NN  schemes  are  special  cases  of  the  above, 
obtained  by  setting  arc  costs  to  zero,  l.e.,  now  f(8*)  = d(X,Y^^;F). 
The  above  three  general  NN  procedures  can  be  Implemented  as  state 
space  searches  giving  S-admlsslble,  B-admlsslble  and  B-admlsslble 
with  zero  arc  costs,  procedures  respectively.  Corresponding  to  the 
earlier  problem  of  finding  upper  and  lower  bounds  on  the  risk  In 
the  parametric  cases.  Is  the  problem  here  of  computing  lower  and 
upper  bounds  on  the  d(  ) measure.  Bounds  for  various  metric  and 
nonmetrlc  similarity  measures  are  derived  In  [Kulkarnl  (1976), 
Kulkarnl  & Kanal  (1978) ] . The  extension  to  K-NN  calculations  Is 
Immediate. 

A branch  and  bound  algorithm  for  computing  nearest  neighbors 
by  Fukunaga  and  Narendra  (1975)  Is  a special  case  of  B-admlsslble 
search  with  zero  arc  costs.  First  the  prototype  samples  are 
hierarchically  decomposed  Into  disjoint  subsets,  represented  by  a 
tree  structure;  any  clustering  technique  can  be  used  with  computa- 
tional efficiency,  rather  than  meaningful  groupings,  being  the  main 


criterion.  Two  rules  are  used  In  the  algorithm. 


Let  S be  the 
P 


set  of  samples  associated  with  node  p In  the  tree.  Let  be  the 
mean  of  S and  let  y be  max  {d(X. , M )|x.  e S }.  Let  B be  the 
distance  to  X of  the  current  nearest  neighbor.  Then  Rule  1 In  the 
Branch  and  Bound  algorithm  Is:  discard  X^  c as  the  potential 
nearest  neighbor  of  X If  B + Yp  ^ d(X,Mp).  Rule  2 Is:  Discard 
X^  e Sp  If  B + d(X^,Mp)  < d(X,Mp).  Several  computational  experi- 
ments are  reported. 


For  exaiiq)le,  the  preprocessing  step  of  dividing  1000  bivariate 
gausslan  pseudorandom  samples  into  27  final  subgroups,  by  succes- 
sively applying  the  three-means  algorithm,  took  12,000  distance 
computation.  When  the  branch  and  bound  algorithm  was  applied  to 
an  additional  1000  samples,  on  average  61  distance  calculations 
were  required  and  no  test  sample  took  more  than  87.  For  3000  samples 
uniformly  distributed  In  8 dimensional  space,  the  number  of  groups 
was  236,  and  on  average  a NN  search  took  431  distance  calculations. 
Branch  and  Bound  algorithms  for  feature  subset  selection  and  clus- 
tering appear  in  [Fukunaga  and  Narendra  (1976)]. 

To  aid  K-NN  computation,  Friedman,  Basket  & Shustek  (1973) 
sort  the  labelled  samples  on  the  values  of  one  of  the  t coordinates. 
For  each  test  point,  the  prototypes  n are  examined  In  the  order  of  the 
projected  distance  d^^  from  the  test  point  on  the  sorted  coordinate. 
When  dj^  > d^  the  i-space  distance  to  the  Kth  closest  point  of  those 
already  examined,  no  more  protr'ypes  need  be  examined.  Best  behavior 
is  obtained  by  sorting  on  the  ’alues  of  each  of  the  coordinates 
independently,  and  then  selecting  the  axis  with  the  smallest  projec- 
ted local  density,  i.e.,  the  largest  spread.  Sparsity  is  calculated 
over  a set  of  n^  points  centered  on  the  test  sample  using  for  a 
value  of  n^  the  expected  number  of  distance  calculations  under  a 
uniform  distribution. 

Assuming  a uniform  distribution  and  assuming  that  sorting  Is 
performed  on  one  coordinate  at  random,  the  expected  number  of 
distance  calculations  Is 

E[n^]  < n ^ [K  I T a/2)]  ^ (2n)  ^ 

Preprocessing  Is  proportional  to  n £ log  n. 

Simulation  experiments  using  the  distributions:  uniform  on 
the  unit  square,  bivariate  normal,  bivariate  Cauchy  showed  that  the 
analytical  expression  provides  a close  upper  bound  for  actual 


average  performance.  For  example,  for  t ^ 1,  with  n “ 1000, 

NN  search  would  require  an  (upper  bound)  average  of  112 
distance  calculations. 

The  efficiency  of  this  K-NN  algorithm  decreases  slightly 

with  K and  more  rapidly  with  the  number  of  dimensions  t.  For 

example  for  £ * 8,  and  n * 1000,  with  K “ 1,  the  number  of  distance 

calculations  Is  upperbounded  by  60Z  of  the  number  for  a brute  force 

calculation.  If  denotes  the  total  number  of  test  samples,  a 

rough  approximation  of  the  breakeven  point  for  using  this  procedure 

over  the  bruteforce  method  Is  N m n log  n. 

n-E(n^) 


Kulkaml  (1976)  using  examples  from  Euclidean  measure,  and 
similarity  measures  for  binary  vectors,  showed  that  one  can, 
relatively  Inexpensively  obtain  bounds  which  ordered  search  S- 
admlsslble  and  B-admlsslble  algorithms  to  reduce  the  measurement 
cost  or  the  number  of  distance  computations  needed  to  classify  a 
test  sample.  Computational  results  similar  to  the  branch  and 
bound  method,  which  Is  subsumed,  could  be  anticipated.  Analysis 
of  the  expected  number  of  distance  calculations  remains  to  be  done. 

Recent  theoretical  results  on  K-NN  performances  bounds  and 
error-reject  estimation  further  motivate  Interest  In  NN  search 
algorithms. 

Given  an  unlabelled  sample  X whose  label  6 e [1,2...M]  Is  to 
be  decided,  and  given  a finite  set  of  n labeled  samples  (X^,e^), 
(X.,,0.,) . . . (X  ,0  ),  a rule  Is  called  K-local  If  the  decision 
0 depends  only  on  those  pairs  (X^,0^)  for  which  X^  Is  one  of  the 
K-NN  of  X.  Let  denote  the  K-local  rule  error  probability,  and 


lD  and 


H 

L"  denote  respectively  the  resubstltutlon  error 


let  L , 

n n 

estimate,  the  deleted  ("leave-one-out")  estimate  and  the  hold  out 

estimate.  Rogers  and  Wagner  (1978)  prove  that  - L®  ) ^ 

n n 

Is  bounded  by  A/a  where  A Is  an  explicitly  given  small  constant 
depending  only  on  K and  the  number  of  categories  M.  The  bound 


p 


does  not  depend  on  the  number  of  dimensions,  t , which  suggests 
that  local  rules  exchange  K for  Z [see  Cover  & Wagner  (1976)  on  this 
and  other  topics  In  non-parametrlc  discrimination,  finite  memory 
learning  and  pattern  complexity].  Recently  Devroye  and  Wagner 
(197 7a, b)  presented  distribution-free  bounds  for 

A /S  A A 

Prob  { I L -L  I } > e } where  L stands  for  and 

' n n ' — n n n n 

A modified  K-NN  rule  Is  obtained  by  allowing  rejects.  The 
modified  rule,  denoted  as  the  (K,K')  - NN  rule  makes  the  same 
decision  as  the  K-NN  rule  whenever  one  or  more  labels  receives  at 
least  K'  votes  from  among  the  K-NN  of  X;  otherwise  the  test  sample 
Is  rejected.  Let  and  denote  the  error  and  reject 

rates  respectively,  for  the  (K,K')-NN  rule.  Reject  rates  may  be 
obtained  from  unlabelled  samples.  Devljver  (1976)  presents  a dis- 
tribution-free relationship  between  and  (k+1) ' 

which  allows  the  error  rate  to  be  obtained  without  having  to  label 
test  samples  and  count  errors.  The  price  Is  that  In  addition  to 
the  K-NN's,  the  (k+l)8t  - NN  will  have  to  be  found.  The  exchange 
Is  between  Che  labels  of  Che  test  samples  and  Che  label  of  the 
(k+1) St  NN.  Devljver' s non-parametrlc  results  seems  to  improve 
the  prospect  for  practical  estimation  of  error  rates  from  unlabelled 
samples.  As  noted  in  [Kanal(1974) ] experience  in  the  parametric 
case  suggests  that  the  error  rate  predicted  from  the  emperical  re- 
ject rate  can  be  quite  inaccurate  if  the  model  assumed  in  design- 
ing the  classifier  were  inaccurate. 

Another  theoretical  connection  between  K-NN's  and  error  rates 
is  developed  by  Tebbe  (1976).  He  shows  that  the  kth  coefficient, 
in  an  orthogonal  Legendre  series  expansion  of  the  Bayes  risk  func- 
tion for  the  two-class,  zero-one  loss  case,  can  be  estimated  from 
an  expectation  defined  in  terms  of  the  k+2-NN's  of  the  patterns  in 
a random  sample. 


7.  STRUCTURAL  FEATURE  EXTRACTION.  AND 
LIMITATIONS  OF  SYNTACTIC  PATTERN  RECOGNITION 

Feature  extraction  techniques  based  on  Fourier  and  other  In- 
tegral transforms,  matrix  methods  and  linear  operator  theory  a- 
bound  In  pattern  recognition  theory.  Non-linear  feature  extrac- 
tion transformations  are  also  being  Investigated.  For  example. 

In  EStarks  and  de  Flgulelredo  (1977)]  the  transformation  attempts  to 
preserve  graph  theoretic  structures  such  as  mlnumum  spanning  tree, 
maximally  complete  subgraphs,  Inconslstant  edges  and  diameter 
edges  derived  from  data  points. 

For  complex  problems  such  as  detecting  and  Identifying  struc- 
tured elements  In  noisy  biomedical  waveforms  or  aerial  photographs, 
most  of  these  feature  extraction  methods  provide  a round-about. 
Inefficient  way  of  recapturing  structures  of  Interest  apparent  In 
portions  of  the  scene.  Future  electro-optical  and  biologically 
motivated  Implementations  may  change  this  appraisal. 

We  would  like  to  efficiently  extract  primitive  structural 
elements  (morphs)  which  are  perceptually  higher  level  objects  than 
scalar  measurements  and  use  a variety  of  relationships  among  them 
In  describing  and  recognizing  patterns.  Syntactic  pattern  recogni- 
tion [see  Fu  (1974) ] was  presumably  Intended  to  overcome  some  of 
the  limitations  of  statistical  pattern  recognition.  However,  syn- 
tactic pattern  recognition  has  also  viewed  primitive  extraction  as 
preprocessing.  This  requires  that  all  possibilities  be  considered 
In  all  regions  of  the  data.  Also  separating  the  extraction  of 
morphs  from  the  analysis  of  structure  excludes  each  process  from 
information  available  to  the  other. 

Formal  language  theory  had  addressed  some  problems  of  am- 
biguity and  error.  Earley's  parser  [Earley  (1970)]  was  developed 
to  handle  ambiguous  context-free  grammars  (CFG) . Aho  and  Peterson 
(1972)  showed  how  to  model  errors  of  Insertion,  deletion  and  mutua- 
tion  of  terminal  symbols  by  adding  productions  to  a grammar,  and 
developed  a minimum-distance  error-correcting  parser  from  the 
Earley  parser.  Lyons  (1974)  developed  a least  errors  parser  by 
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extending  the  Earley  parsing  algorithm  rather  than  the  grammar. 

Assuming  syntactic  models  for  pattern  noise  and  deformation  are 
available  (unlikely  to  be  true  for  real  data),  Lu  and  Fu  (1976) 
added  probabilities  to  the  techniques  of  Aho  and  Peterson  (1972) 
and  used  Earley's  algorithm  to  develop  a maximum  likelihood  parse 
for  any  string  over  the  terminal  vocabulary. 

While  partially  addressing  ambiguity  of  analysis  and  descrip- 
tion, syntactic  pattern  recognition  has  completely  Ignored  ambiguous 
detection  of  primitives.  If  very  small  low  level  primitives  are 
used,  e.g.,  line  or  polynomial  segments  derived  from  a piecewise 
functional  approximation  of  an  entire  waveform,  ambiguity  of 
detection  may  be  avoided  but  other  problems  are  created.  The 
strings  become  too  long-whlch  makes  parsing  economy  critical,  and 
the  segmentations  are  not  anthropomorphic,  which  creates  a need  for 
grammatical  Inference. 

Casting  pattern  analysis  and  description  directly  Into  the 
mold  available  from  formal  language  theory,  syntactic  pattern  recog- 
nition took  on  the  burden  of  the  concatenation  relation  and  left- 
right  parsing.  Also,  In  general,  a one-dlrectlonal,  l.e.,  strictly 
top-down  or  strictly  bottom  up  parse  procedure  has  been  adopted. 

Strictly  bottom-up  methods  [e.g.  Fu  (1974)  Ledley  (1966),  Horowitz 
(1975),  Pavlldls  (1976)]  do  not  take  advantage  of  a priori  know- 
ledge during  segmentation,  while  strictly  top-down  methods 
[e.g.  Harlow  and  Elsenbels  (1973),  Stockman,  Kanal  & Kyle  (1976), 

Walker  (1974)]  can  Inefficiently  generate  hypotheses  that  are  in 
no  way  related  to  a given  Instance  of  data. 

Some  desired  objectives  for  a structural  analysis  procedure 
are:  (1)  the  pattern  analysis  should  be  able  to  proceed  In  a bi- 

directional data-dlrected  and  model-directed  manner  with  primitive 
‘ extraction  and  structural  analysis  coupled;  (2)  the  analysis  should 

not  be  restricted  to  a cannonlcal  left-right  scan  of  the  entire  data 
but  should  be  non-left-right,  selective,  and  focus  on  prominent  morphs; 

(3)  multiple  and  ambiguous  Interpretations  should  be  developed  on 

j 
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a best-first  basis,  with  ambiguity  permitted  In  both  segmentation 
and  structural  analysis. 

[ For  speech  recognition.  Miller  (1973)  and  Reason  (1976)  have 

I ' used  context-free  grammar  models  with  non-left-right  analysis,  and 

they  and  others  [Reddy  (1973),  Walker  (1974)]  allowed  ambiguous 
detection  of  vocabulary  terminals.  A structural  analysis  paradigm 
f which  realizes  the  above  enumerated  object ives-not  satisfactorily 

addressed  prevlously-ls  developed  In  [Stockman  (1977a)]  which  also 
presents  the  Implementation  of  the  paradigm  In  an  extensively  tested 
waveform  parsing  system.  Recently  the  approach  has  also  been  used 
for  the  recognition  of  objects  in  Imagery  [Stockman  (1977b)].  This 
non-dlrectional  approach  begins  by  identifying  certain  prominent 
primitive  components  which  can  be  reliably  extracted  and  then  uses 
model-directed,  data-conf Irmed  search  for  the  remaining  pattern 
structure.  The  theoretical  concepts  underlying  the  algorithm  are 
mentioned  next. 

8.  AND/OR  GRAPHS  (AG'S).  STATE-SPACE  REPRESENTATION 
(SSR)  AND  NON-DIRECTIONAL  ANALYSIS  (NDA)  IN  FEATURE  EXTRACTION 

A Problem-Reduction  representation  (PRR)  recursively  tries  to 
solve  a problem  by  transforming  It  Into  several  simpler  equivalents, 
any  one  of  which  If  solved,  solves  the  problem,  or  transforming  It 
Into  several  subproblems,  all  of  which  if  solved,  solve  the  original 
problem.  Using  nodes  to  represent  problems  and  subproblems  PRR's 
are  modeled  by  AND/OR  graphs  (AG's)  In  which  equivalent  problems 
are  represented  by  OR  nodes  and  subproblems  of  a node  are  repre- 
sented by  AND  nodes.  The  edges  leading  to  the  AND  nodes  are  tied 
together  with  an  arc  to  Indicate  that  all  of  the  AND  descendents 
of  a node  must  be  solved  In  order  to  declare  their  parent  solved. 

By  Inserting  dummy  OR  nodes,  mixed  AND/OR  nodes  can  be  represented  ' 

by  combinations  of  pure  OR  nodes  and  pure  AND  nodes. 

Solving  a problem  at  the  root  node  of  an  AG  Involves  searching 
(making  explicit)  portions  of  the  AG  to  find  primitive  problems 
whose  solution  allows  the  original  problem  to  be  declared  solved, 
or  showing  that  no  such  solvable  primitive  problems  are  present  In 


the  AG,  in  which  case  an  "unsolvable"  declaration  Is  passed  back 
up  the  AG.  Solvable  primitive  problems  are  represented  by  terminal 
nodes  in  the  AG.  Costs  of  transforming  problems  into  equivalent 
problems  and  subproblems  may  be  associated  with  the  edges  of  an  AG. 

An  informal  presentation  of  PRR  and  AG's  is  given  in  Nilsson 
(1971).  Hall  (1973)  showed  the  equivalence  of  a CFG  to  a finite 
AG,  and  Chang  and  Slagle  (1971)  gave  one  approach  to  converting 
PRR  to  SSR  so  that  the  A*  ordered  search  algorithm  can  be  used  to 
find  optimal  solution  graphs  in  PRR  according  to  a sum  of  edges 
cost  criteria.  Vanderbrug  and  Mlnker  (1975)  gave  a formal  treat- 
ment of  PRR  and  showed  an  approach  to  bidirectionally  relating  AG 
search  and  state-space  search.  A different  formal  treatment  and 
a different  conversion  between  PRR  and  SSR  Is  given  In  [Stockman 
(1977a)].  Being  motivated  by  non-directional  structural  feature 
extraction  this  Is  the  one  of  Interest  here. 

A PRR  Is  a 5 tuple  {P,r,t,u,B}  where  P = is  sn  enumerable 

set  of  problem  descriptions,  B c P Is  a set  of  Initial  problems 
only  one  of  which  need  be  solved,  r Is  the  ordered  successor 
function  r:  PxN  P,  where  N is  the  set  of  natural  numbers,  t is 
the  node  type  function  t;  P {AND  OR}  and  u Is  the  node  solution 
function  u:  P -►  (Live, Solved, Dead}.  A problem  is  live  when  it  is 
not  known  to  be  solved  or  dead,  l.e.,  unsolvable.  Solvability  of 
OR  nodes  implies  solvability  of  their  parent,  while  unsolvability 
of  any  AND  node  Implies  unsolvability  of  its  parent.  Let  R(P^) 
denote  the  set  of  all  successors  of  problem  P^^.  A problem 
1 e PRR  is  solved  iff  (1)  u(l)  » solved;  or  (2)  u(l)  = Ldve  and 
there  exists  successor  k e R(i)-^  u(k)  = Solved  and  t(k)  = OR;  or 
(3)  u(l)  = LIVE,  and  for  any  successor  k £ R(l),  u(k)  = SOLVED  and 
t (k)  = AND,  PRR  has  a solution  iff  some  problem  1 £ B is  solved. 

Every  OR  successor  of  a problem  (node)  is  called  a primary 
successor  but  only  the  first  AND  successor  of  a node  is  a primary 
successor.  There  is  no  point  in  examining  any  AND  alternative  if  a 
previous  AND  subproblem  is  unsolvable.  Hence  AND  alternatives 


are  considered  sequentially  and  the  primary  successor  Is  examined 
first.  A primary  descendant  of  the  original  problem  (the  root 
node)  is  either  a primary  successor  of  the  root,  or  the  primary 
successor  of  some  primary  descendant  of  the  root. 

Recognition  of  a solved  problem  (primary  terminal)  triggers 
the  search,  under  the  a priori  constraints  embodied  in  the  PRR, 
for  the  solution  to  problems  for  which  the  solved  problem  is  a 
primary  successor.  Typically,  this  would  Involve  a top-down 
(model-directed)  search  for  the  solution  of  other  non-primary 
successors.  If  the  Inverse  of  the  primary  successor  relation 
Is  available  In  the  PRR,  as  Is  the  case  for  CFG's  and  hence  for 
finite  AG’s,  the  analysis  can  proceed  recursively  In  either 
bottom-up  or  top-down  direction. 

For  the  feature  extraction  application  primary  terminal  nodes 
represent  the  prominent  morphs  which  can  be  reliably  extracted  from 
the  data  without  any  syntactic  Information.  Problems  which  are  not 
primary  are  always  solved  with  respect  to  other  morphs  and  properly 
related  syntax.  (The  data  segmentor  Is  only  asked  to  do  work  that 
Is  consistent  with  the  global  segmentations /Interpretation  being 
maintained  by  the  structural  analyzer.) 

Individual  morphs  are  defined  as  constrained  mathematical 
curves  and  least-squares  theory  is  the  basis  for  morph  detection 
and  quality  evaluation.  When  recognized,  each  substructure  of  the 
PRR  must  be  assigned  a quality  1.0  to  reflect  the  confidence  of 
recognition.  For  primary  terminals,  Q Is  obtained  from  the  morph 
primitive  detector  itself.  For  secondary  morphs,  Q depends  not  only 
on  the  quality  of  the  detection  but  also  on  the  degree  to  which  the 
detection  satisfies  the  structural  hypothesis.  For  a non-prlmltive 
structure,  the  quality  may  be  defined  by  the  minimum  quality  of  Its 
substructures.  The  merit  of  a path  In  model-space  Is  defined  as  the 
mlnlmtim  quality  of  any  structures  Identified  along  that  path.  The 
ordered  search  for  interpretations  will  find  the  highest  quality  one 
first  because  It  always  extends  the  highest  merit  path  first. 


Multiple  interpretations  and  non-directionality  are  facilitated 
by  converting  the  PRR  into  an  equivalent  SSR  such  that  partial  solu- 
tion trees  in  the  AG,  l.e.,  partial  interpretations  of  the  data, 
become  encoded  as  states  in  SSR.  Best-first  search  in  this  SSR  can 
be  shown  to  produce  the  mlnlmax  (best)  solution  tree  (interpretation) 
in  PRR.  In  practice,  if  the  intervals  of  search  for  primitive 
features  are  not  tightly  constrained  by  syntax  an  inadmissible  but 
efficient  heuristic  detection  strategy  is  to  scan  exhaustively  only 
once  for  morphs  of  certain  minimum  size,  and  then  using  pertubatlon 
operators  in  a state  space  search,  grow  each  detection  so  long  as 
quality  is  acceptable.  In  any  event,  as  in  the  multiclass  pattern 
classification  work  described  earlier,  the  unifying  concept  between 
the  structural  analysis  and  detection  algorithms  is  that  of  state- 
space  search. 

Because  of  the  correspondence  between  CFG's  and  finite  AG’s, 
the  new  NDA  algorithm  in  [Stockman  (1977a)]  can  be  viewed  either  as 
a "problem  solver"  or  a parser.  When  applied  to  AG's  representing 
games,  i.e.,  game  trees,  the  algorithm  appears  competitive  with, 
although  different  from,  the  a-3  tree  pruning  procedure  [Nilsson 
(1971)]  in  efficiently  producing  the  minlmax  solution  tree.  Knuth 
and  Moore  (1975)  have  procedurally  defined  the  a-B  method,  analyzed 
its  performance  under  some  distributional  assumptions  and  shown  it 
to  be  optimal  according  to  certain  statistical  criteria.  Other 
analyses  appear  in  (Fuller,  Gaschnlg  and  Gillogly  (1973)]  and 
[Newborn(1977) ] . Apart  from  some  simulations,  no  such  analysis  of 
the  NDA  algorithm  has  been  done. 

In  AG's  an  underlying  assumption  is  that  subproblems  can  be 
solved  Independently.  Levi  and  Slrovlch  (1976)  defined  Generalized 
AND/OR  graphs  (GAG's)  in  which  subproblem  Interdependence  is  allowed. 
Such  gag's  and  other  formulations  of  GAG's  enlarge  the  potential 
representations  for  which  Stockman's  NDA  algorithm  and  conversions 
to  SSR  may  be  defined  and  thereby  enlarge  the  models  available  for 
structural  representation  and  analysis. 
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The  above  tutorial  presentation  was  designed  to  enable  access 
to  some  of  the  literature  and  methodology  behind  some  innovative 
approaches  to  multivariate  statistical  classification  and  structural 
feature  extraction  theory  and  practise,  which  are  quite  different  from 
what  currently  appears  in  statistical  Journals.  Much  recent  work 
in  the  statistical  and  engineering  literature  on  pattern  recognition 

is  along  lines  already  covered  in  Kanal  (1974) 

The  conference  proceedings,  books  and  edited  collection,  sur- 
veys and  reports  on  prospects,  listed  in  the  bibliography  ease  the 
task  of  covering  the  proliferating  literature  on  pattern  recognition 
techniques  and  applications.  [Agrawala  (1976)]  reprints  two  his- 
torically Important  out  of  print  reports  by  E.  Fix  and  J.L.  Hodges, 
which  motivated  the  later  interest  in  NN  methods.  Far  removed  from 
the  pattern  recognition  practitioners'  present  concerns  are  the 
fascinating,  but  rather  difficult  to  follow  works  of  Ulf  Grenander 
(1976)  on  pattern  synthesis,  and  William  Hoffman  (1976)  on  a Lie 
transformation  group  theory  of  form  perception  and  feature  extrac- 
tion. 

Of  immediate  concern  are  certain  practical  problem  of  measure- 
ment complexity  and  error  estimation.  Waller  & Jain  (1976),  using 
a model  [Abend  et  al  (1965)]  with  first  order  nonstationary  Markov 
dependent  binary  features,  showed  that  Independence  of  measurements 
is  not  a necessary  condition  either  for  the  absence  of  the  peaking 
phenomenon  of  measurement  complexity  [Chandrasekaran  & Jain  (1977), 

Van  Ness  (1977)]  or  for  perfect  discrimination.  Van  Campenhour  (1977) 
resolves  the  paradox  of  the  peaking  phenomenon  by  showing  that  in 
a true  Bayesian  formulation  it  is  attributable  to  Improper  com- 
parisons of  statistically  Incomparable  models.  In  practise  the 
phenomenon  exists.  In  error  estimation,  Toussaint  (1975),  using  a 
non  parametric  classifier,  concluded  that  an  estimator  should  be 
formed  by  equally  weighting  the  resubstltutlon  and  rotation  estima- 
tors while  MacLachlan  (1977),  on  the  basis  of  asymptotic  results 
for  parametric  classification  using  multivariate  normal  populations, 
suggests  that  very  little  weight  should  be  assigned  to  the  resub- 
stltutlon estimator.  Clearly  specific  examples  may  serve  as  counter- 


examples  only  but  not  as  the  basis  of  otherwise  drawing  general 
conclusions . 

In  addition  to  the  problems  cited  earlier,  much  remains  to 
be  done  on  the  decision-tree  and  state-space  search  formulation 
of  hierarchical  classifiers  Including  the  development  of  optimal 
procedures  for  continuous  and  mixed  random  variables.  A list  of 
general  problems  of  automatic  and  semi-automatic  pattern  recogni- 
tion appears  In  Kanal  (1977) 
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ABSTRACT 


Noting  the  major  limitations  of  the  much  developed  multi- 
variate statistical  and  syntactic  pattern  recognition  models,  this 
paper  describes  in  a tutorial  manner  alternate  representations, 
based  on  stage-space  and  AND/OR  graphs  and  ordered  search  strategies, 
for  multistage  and  nearest  neighbor  classification  and  for  struc- 
tural pattern  analysis  and  feature  extraction.  Some  recent  work 
in  pattern  recognition  is  reviewed  from  these  vantage  points. 

In  addition,  the  paper  touches  on  recent  contributions  to  the 
continuing  attempts  to  understand  feature  subset  selection,  measure- 
ment complexity  and  nonparametrlc  classification  and  error  estimation. 
Surveys,  conference  proceedings  and  edited  collections  providing 
quick  access  to  the  recent  literature  on  pattern  recognition 
methodologies  and  applications,  are  cited  in  the  bibliography. 
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